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(54) Voice processing 



(57) An voice synthesizing unit performs voice syn- 
thesizing processing, based on the state of emotion of 
a robot at an emotion/instinct model unit. For example, 
in the event that the emotion state of the robot repre- 
sents "not angry", synthesized sound of "What is it?" is 



generated at the voice synthesizing unit. On the other 
hand, in the event that the emotion state of the robot 
represents "angry", synthesized sound of "Yeah, what? 
" is generated at the voice synthesizing unit, to express 
the anger Thus, a robot with a high entertainment na- 
ture is provided. 
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Description 

[0001] The present invention relates to a voice 
processing device, voice processing method, and re- 
cording medium, and particularly (though not exclusive- 5 
ly) relates to a voice processing device, voice process- 
ing method, and recording medium suitably used for a 
robot having voice processing functions such as voice 
recognition voice synthesizing, and so forth. 
[0002] Heretofore, many robots which output synthe- 10 
sized sound when a touch switch is pressed (the defini- 
tion of such robots in the present specification includes 
stuffed animals and the like) have been marketed as toy 
products. 

[0003] However, with conventional robots, the relation *5 
between the pressing operation of the touch switch and 
synthesized sound is fixed : so there has been the prob- 
lem that the user gets tired of the robot. 
[0004] Various respective aspects and features of the 
invention are defined in the appended claims. 20 
[0005] The present invention has been made in light 
of such, and accordingly, it is an object thereof to provide 
a robot with a high entertainment factor. 
[0006] A voice processing device according to the 
present invention comprises: voice processing means 25 
for processing voice; and control means for controlling 
voice processing by the voice processing means, based 
on the state of the robot. 

[0007] The control means may control the voice proc- 
ess based on the state of actions, emotions or instincts 30 
of the robot. The voice processing means may comprise 
voice synthesizing means for performing voice synthe- 
sizing processing and outputting synthesized sound, 
and the control means may control the voice synthesiz- 
ing processing by the voice synthesizing means, based 35 
on the state of the robot. 

[0008] The control means may control phonemics in- 
formation and pitch information of synthesized sound 
output by the voice synthesizing means, and the control 
means may also control the speech speed or volume of 40 
synthesized sound output by the voice synthesizing 
means. 

[0009] The voice processing means may extract the 
pitch information or phonemics information of the input 
voice, and in this case, the emotion state of the robot 45 
may be changed based on the pitch information or pho- 
nemics information, or the roboL may take actions cor- 
responding to the pitch information or phonemics infor- 
mation. 

[0010] The voice processing means may comprise so 
voice recognizing means for recognizing input voice, 
and the robot may take actions corresponding to the re- 
liability of the voice recognition results output from the 
voice recognizing means, or the emotion state of the ro- 
bot may be changed based on the reliability. 55 
[0011] The control means may recognize the action 
which the robot is taking, and control voice processing 
by the voice processing means based on the load re- 



garding that action . Also, the robot may take actions cor- 
responding to resources which can be appropriated to 
voice processing by the voice processing means. 
[0012] The voice processing method according to the 
present invention comprises: an voice processing step 
for processing voice; and a control step for controlling 
voice processing in the voice processing step, based on 
the state of the robot. 

[0013] The recording medium according to the 
present invention records programs comprising; an 
voice processing step for processing voice; and a con- 
trol step for controlling voice processing in the voice 
processing step, based on the state of the robot. 
[0014] With the voice processing device, voice 
processing method, and recording medium according to 
the present invention, voice processing is controlled 
based on the state of the robot. 

[0015] The invention will now be described by way of 
example with reference to the accompanying drawings, 
throughout which like parts are referred to by like refer- 
ences, and in which: 

Fig. 1 is a perspective view illustrating an external 
configuration example of an embodiment of a robot 
to which the present invention has been applied; 
Fig. 2 is a block diagram illustrating an internal con- 
figuration example of the robot shown in Fig. 1 ; 
Fig. 3 is a block diagram illustrating a functionalcon- 
figuration example of the controller 1 0 shown in Fig . 
2; 

Fig. 4 is a diagram illustrating an emotion/instinct 
model; 

Figs. 5A and 5B are diagrams describing the 

processing in the emotion/instinct model unit 51 ; 

Fig. 6 is a diagram illustrating an action model; 

Fig. 7 is a diagram for describing the processing of 

the attitude transition mechanism unit 54; 

Fig. 8 is a block diagram illustrating a configuration 

example of the voice recognizing unit 50A; 

Fig. 9 is a flowchart describing the processing of the 

voice recognizing unit 50A; 

Fig. 1 0 is also a flowchart describing the processing 
of the voice recognizing unit 50A; 
Fig. 1 1 is a block diagram illustrating a configuration 
example of the voice synthesizing unit 55; 
Fig. 12 is a flowchart describing the processing of 
the voice synthesizing unit 55; 
Fig. 1 3 is also a flowchart describing the processing 
of the voice synthesizing unit 55; 
Fig. 1 4 is a block diagram illustrating a configuration 
example of the image recognizing unit SOB; 
Fig. 15 is a diagram illustrating the relationship be- 
tween the load regarding priority processing, and 
the CPU power which can be appropriated to voice 
recognizing processing; and 

Fig. 16 is a flowchart describing the processing of 
the action determining mechanism unit 52. 
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[0016] Fig. 1 illustrates an external configuration ex- 
ample of an embodiment of a robot to which the present 
invention has been applied, and Fig. 2 illustrates an 
electrical configuration example thereof. 
[0017] With the present embodiment, the robot is a 
dog-type robot, with leg units 3A, 3B, 3C, and 3D linked 
to a torso unit 2, at the front and rear right and left por- 
tions, and with a head unit 4 and tail unit 5 respectively 
linked to the front portion and rear portion of the torso 
unit 2. 

[0018] The tail unit 5 is extracted from a base portion 
SB provided to the upper plane of the torso unit 2 so as 
to be capable of bending or rocking with a certain degree 
of freedom. 

[0019] Stored in the torso unit 2 are a controller 10 
which performs control of the entire robot, a battery 11 
which is the power source for the robot, an internal sen- 
sor unit 14 made up of a battery sensor 12 and thermal 
sensor 13, and so forth. 

[0020] Positioned in the head unit 4 are a microphone 
15 which serves as an "ear", a CCD (Charge Coupled 
Device) camera 16 which serves as an "eye", a touch 
sensor 1 7 which acts as the tactual sense, a speaker 
18 serving as the "mouth", etc., at the respective posi- 
tions. 

[0021 ] Further, provided to the joint potions of the leg 
units 3A through 3D, the linkage portions of the leg units 
3A through 3D to the torso unit 2, the linkage portion of 
the head unit 4 to the torso unit 2, the linkage portions 
of the tail unit 5 to the torso unit 2, etc., are actuators 
3AA 1 through 3AA k? 3BA 1 through 3BA kf 3CA., through 
3CA k , 3DA, through 3DA k , 4A 1 through 4A L: 5A-,, and 
5A 2; as shown in Fig. 2. 

[0022] The microphone 1 5 in the head unit 4 collects 
surrounding voice (sounds) including speech of the us- 
er, and sends the obtained voice signals to the controller 
1 0. The CCD camera 1 6 takes images of the surround- 
ing conditions, and sends the obtained image signals to 
the controller 10. 

[0023] The touch sensor 1 7 is provided at the upper 
portion of the head unit 4 for example, so as to detect 
pressure received by physical actions from the user 
such as "petting" or "hitting", and sends the detection 
results as pressure detection signals to the controller 1 0. 
[0024] The battery sensor 12 in the torso unit 2 de- 
tects the remaining amount of the battery 1 1 , and sends 
the detection results as remaining battery amount de- 
tection signals to the controller 10. The thermal sensor 
1 3 detects heat within the robot, and sends the detection 
results as thermal detection signals to the controller 1 0. 
[0025] The controller 1 0 has a CPU (Central Process- 
ing Unit) 10A and memory 1 0B and the like built in, and 
performs various types of processing by executing con- 
trol programs stored in the memory 10B at the CPU 
10A. 

[0026] That is, the controller 10 judges surrounding 
conditions, commands from the user, actions performed 
upon the robot by the user, etc., or the absence thereof, 



based on voice signals, image signals, pressure detec- 
tion signals, remaining battery amount detection sig- 
nals, and thermal detection signals, from the micro- 
phone 15, CCD camera 16, touch sensor 17, battery 

5 sensor 1 2, and thermal sensor 1 3. 

[0027] Further, based on the judgement results and 
the like, the controller 10 decides subsequent actions, 
and drives actuators necessary to this end from the ac- 
tuators 3AA 1 through 3AA k , 3BA 1 through 3BA k , 3CA 1 

10 through 3CA K , 3DA 1 through 3DA k , 4A A through 4A(_, 
5A L , and 5A 2 , based on the decision results, thereby 
causing the robot to perform actions such as moving the 
head unit vertically or horizontally, moving the tail unit 
5, driving the leg units 3A through 3D so as to cause the 

'5 robot to take actions such as walking, and so forth. 
[0028] Also, if necessary, the controller generates 
synthesized sound which is supplied to the speaker 18 
and output, or unshown LEDs (Light-Emitting Diodes) 
provided at the position of the "eyes" of the robot to go 

20 on, off, or blink. 

[0029] Thus, the robot is arranged so as to act in an 
autonomic manner, based on surrounding conditions 
and the like. 

[0030] Next, Fig. 3 illustrates a functional configura- 
25 tion example of the controller shown in Fig. 2. The func- 
tional configuration shown in Fig. 3 is realized by the 
CPU 10A executing the control programs stored in the 
memory 10B. 

[0031] The controller 10 comprises a sensor input 
^o processing unit 50 which recognizes specific external 
states, an emotion/instinct model unit 51 which accumu- 
lates the recognition results of the sensor input process- 
ing unit 50 and expresses the state of emotions and in- 
stincts, an action determining mechanism unit 52 which 
35 determines subsequent action based on the recognition 
results of the sensor input processing unit 50 and the 
like, an attitude transition mechanism unit 53 which 
causes the robot to actually take actions based on the 
determination results of the action determining mecha- 
40 nism unit 52, a control mechanism unit 54 which drives 
and controls the actuators 3AA n through 5A 1 and 5A 2 , 
and an voice synthesizing unit 55 which generates syn- 
thesized sound. 

[0032] The sensor input processing unit 50 recogniz- 
45 es certain external states, action performed on the robot 
by the user, instructions and the like from the user, etc., 
based on the voice signals, image signals, pressure de- 
tection signals, etc., provided from the microphone 15, 
CCD camera 16, touch sensor 1 7, etc., and notifies the 
50 state recognition information representing the recogni- 
tion results to the emotion/instinct model unit 51 and ac- 
tion determining mechanism unit 52. 
[0033] That is, the sensor input processing unit 50 has 
an voice recognizing unit 50A, and the recognizing unit 
55 50A performs voice recognition following the control of 
the action determining mechanism unit 52 using the 
voice signals provided from the microphone 15, taking 
into consideration the information obtained from the 
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emotion/instinct model unit 51 and action determining 
mechanism unit 52 as necessary. Then, the voice rec- 
ognizing unit 50A notifies the emotion/instinct model unit 
51 and action determining mechanism unit 52 of instruc- 
tions and the like of the voice recognition results, such 
as "walk", "down f \ "chase the ball", for example, as state 
recognition information. 

[0034] Also, the sensor input processing unit 50 has 
an image recognizing unit SOB, and the image recogniz- 
ing unit SOB perrorms image recognition processing us- 
ing image signals provided from the CCD camera 16. In 
the event that as a result of the processing the image 
recognizing unit SOB detects "a red round object" or "a 
plane vertical to the ground having a certain height or 
more", for example, image recognition results such as 
"there is a ball" or "there is a wall" are notified to the 
emotion/instinct model unit 51 and action determining 
mechanism unit 52, as state recognition information, 
[0035] Further, the sensor input processing unit 50 
has an pressure processing unit 50C, and the pressure 
processing unit 50C processes pressure detection sig- 
nals provided from the touch sensor 17. Then, in the 
event that the pressure processing unit50C detects, as 
the result of the processing, pressure of a certain thresh- 
old value or greater within a short time, the pressure 
processing unit 50C makes recognition of having been 
"struck (scolded)", while in the event that the pressure 
processing unit 50C detects pressure less than the 
threshold value over a long time, the pressure process- 
ing unit 50C makes recognition of having been "petted 
(praised)". The recognition results thereof are notified 
to the emotion/instinct model unit 51 and action deter- 
mining mechanism unit 52, as state recognition informa- 
tion. 

[0036] The emotion/instinct model unit 51 manages 
both an emotion model and instinct model, representing 
the state of emotions and instincts of the robot, as shown 
in Fig. 4. Here, the emotion model and instinct model 
are stored in the memory 10B shown in Fig. 3. 
[0037] The emotion model is made up of three emo- 
tion units 60A, 60B, and 60C , for example, and the emo- 
tion units 60A through 60C each represent the state (de- 
gree) of "happiness", "sadness", and "anger", with a val- 
ue within the range of 0 to 1 00, for example. The values 
are each changed based on state recognition informa- 
tion from the sensor input processing unit 50, passage 
of time, and so forth. 

[0038] Incidentally, an emotion unit corresponding to 
"fun" can be provided in addition to "happiness", "sad- 
ness", and "anger". 

[0039] The instinct model is made up of three instinct 
units 61 A, 61 B, and 61 C, for example, and the instinct 
units 61 A through 61 C each represent the state (de- 
gree) of "hunger", "desire to sleep", and "desire to exer- 
cise" : from instinctive desires, with a value within the 
range of 0 to 100, for example. The values are each 
changed based on state recognition information from 
the sensor input processing unit 50, passage of time, 
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and so forth. 

[0040] The emotion/instinct model unit 51 outputs the 
state of emotion represented by the values of the emo- 
tion units 60A through 60C and the state of instinct rep- 

5 resented by the values of the instinct units 61 A through 
61 C as emotion/instinct state information, which change 
as described above, to the sensor input processing unit 
50, action determining mechanism unit 52, and voice 
synthesizing unit 55. 

10 [0041 ] Now, at the emotion/instinct model unit 51 , the 
emotion units 60A through 60C making up the emotion 
model are linked in a mutually suppressing or mutually 
stimulating manner, such that in the event that the value 
of one of the emotion units changes, the values of the 

*s other emotion units change accordingly, thus realizing 
natural emotion change. 

[0042] That is, for example, as shown in Fig. 5A, in 
the emotion model the emotion unit 60A representing 
"happiness" and the emotion unit 60B representing 

20 "sadness" are linked in a mutually suppressive manner, 
such that in the event that the robot is praised by the 
user, the value of the emotion unit 60A for "happiness" 
first increases. Further, in this case, the value of the 
emotion unit 60B for "sadness" decreases in a manner 

25 corresponding with the increase of the value of the emo- 
tion unit 60A for "happiness" : even though state recog- 
nition information for changing the value of the emotion 
unit 60B for "sadness" has not been supplied to the emo- 
tion/instinct model unit 51 . Conversely, in the event that 

50 the value of the emotion unit 60B for "sadness" increas- 
es, the value of the emotion unit 60A for "happiness" 
decreases accordingly. 

[0043] Further, the emotion unit 60B representing 
"sadness" and the emotion unit 60C representing "an- 

35 ger" are linked in a mutually stimulating manner, such 
that in the event that the robot is struck by the user, the 
value of the emotion unit 60C for "anger" first increases. 
Further, in this case, the value of the emotion unit 60B 
for "sadness" increases in a manner corresponding with 

40 the increase of the value of the emotion unit 60C for "an- 
ger", even though state recognition information for 
changing the value of the emotion unit 60B for "sadness" 
has not been supplied to the emotion/instinct model unit 
51 . Conversely, in the event that the value of the emotion 

^5 unit 60B for "sadness" increases, the value of the emo- 
tion unit 60C for "anger*' increases accordingly. 
[0044] Further, at the emol ion/instinct model unit 51 , 
the instinct units 61 A through 61 C making up the instinct 
model are also linked in a mutually suppressing or mu- 

50 tually stimulating manner, as with the above emotion 
model, such that in the event that the value of one of the 
instinct units changes, the values of the other instinct 
units change accordingly, thus realizing natural instinct 
change. 

55 [0045] Also, in addition to state recognition informa- 
tion being supplied to the emotion/instinct model unit 51 
from the sensor input processing unit 50, action infor- 
mation indicating current or past actions of the robot, i. 
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e. , representing the contents of actions, such as "walked 
for a long time" for example, are supplied from the action 
determining mechanism unit 52, so that event in the 
event that the same state recognition information is pro- 
vided, different emotion/instinct state information is gen- 
erated according to the actions of the robot indicated by 
the action information. 

[0046] That is to say, as shown in Fig. 5B for example, 
with regard to the emotion model, intensity increasing/ 
decreasing functions 65A through 65C for generating 
value information for increasing or decreasing the val- 
ues of the emotion units 60A through 60C based on the 
action information and the state recognition information 
are each provided to the step preceding the emotion 
units 60A through 60C. The values of the emotion units 
60A through 60C are each increased or decreased ac- 
cording to the values information output from the inten- 
sity increasing/decreasing functions 65A through 65c. 
[0047] As a result, in the event that the robot greets 
the user and the user pets the robot on the head, for 
example, the action information of greeting the user and 
the state recognition information of having been pet on 
the head are provided to the intensity increasing/de- 
creasing function 65A, and in this case, the value of the 
emotion unit 60A for "happiness" is increased at the 
emotion/instinct model unit 51. 

[0048] On the other hand, in the event that the robot 
is petted on the head while executing a task of some 
sort, action information that a task is being executed and 
the state recognition information of having been pet on 
the head are provided to the intensity increasing/de- 
creasing function 65A, but in this case, the value of the 
emotion unit 60A for "happiness" is not changed at the 
emotion/instinct, model unit 51 . 

[0049] Thus, the emotion/instinct model unit 51 does 
not only make reference to the state recognition infor- 
mation, but also makes reference to action information 
indicating the past or present actions of the robot, and 
thus sets the values of the emotion units 60A through 
60C. Consequently, in the event that the user mischie- 
vously pets the robot on the head while the robot is ex- 
ecuting a task of some sort, a unnatural changes in emo- 
tions due to the value of the emotion unit 60A for "hap- 
piness" being increased can be avoided. 
[0050] Further, regarding the instinct units 61 A 
through 61 C making up the instinct model, the emotion/ 
instinct model unit 51 increases or decreases the values 
of each based on both state recognition information and 
action information in the same manner as with the case 
of the emotion model. 

[0051 ] Now, the intensity increasing/decreasing func- 
tions 65A through 65C are functions which generate and 
output value information for changing the values of the 
emotions units 60A through 61 C according to preset pa- 
rameters, with the state recognition information and ac- 
tion information as input thereof, and setting these pa- 
rameters to values differently for each robot would allow 
for individual characteristics for each robot, such as one 



robot being of a testy nature and another being jolly, for 
example. 

[0052] Returning to Fig. 3, the actiQp. determining 
mechanism unit 52 decides the next action based on 

5 state recognition information from the sensor input 
processing unit 50 and emotion/instinct information from 
the emotion/instinct model unit 51 , passage of time, etc. , 
and the decided action contents are output to the atti- 
tude transition mechanism unit 53 as action instruction 

10 information. 

[0053] That is, as shown in Fig. 6, the action deter- 
mining mechanism unit 52 manages finite automatons 
wherein the actions of which the robot is capable of tak- 
ing are corresponding to the state, as action models stip- 

*5 ulating the actions of the robot. The state in the finite 
automaton serving as the action model is caused to 
make transition based on state recognition information 
from the sensor input processing unit 50, the values of 
the emotion model and instinct model at the emotion/ 

20 instinct model unit 51 , passage of time, etc., and actions 
corresponding to the state following the transition are 
determined to be the actions to taken next. 
[0054] Specifically, for example, in Fig. 6, let us say 
that state ST3 represents an action of "standing", state 

25 ST4 represents an action of "lying on side", and state 
ST5 represents an action of "chasing a ball". Now, in the 
state ST5 for "chasing a ball" for example, in the event 
that state recognition information of "visual contact with 
ball has been lost" is supplied, the state makes a tran- 

30 sit ion from state ST5 to state ST3, and consequently, 
the action of "standing" which corresponds to state ST3 
is decided upon as the subsequent action. Also, in the 
event that the robot is in state ST4 for "lying on side" for 
example, and state recognition information of "Get up!" 

35 is supplied, the state makes a transition from state ST4 
to state ST3, and consequently, the action of "standing" 
which corresponds to state ST3 is decided upon as the 
subsequent action. 

[0055] Now, in the event that the action determining 
40 mechanism unit 52 detects a predetermined trigger, 
state transition is executed. That is to say, in the event 
that the time for the action corresponding to the current 
state has reached a predetermined time, in the event 
that certain state recognition information has been re- 
45 ceived, in the event that the value of the state of emotion 
(i.e., values of emotion units 60A through 60C) or the 
value of the state of instinct (i.e., values of instinct units 
61 A through 61 C) represented by the emotion/instinct 
state information supplied from the emotion/instinct 
50 model unit 51 are equal to or less than, or are equal to 
or greater than a predetermined threshold value, etc., 
the action determining mechanism unit 52 causes state 
transition. 

[0056] Note that the action determining mechanism 
55 unit 52 causes state transition of the finite automaton in 
Fig. 6 based not only state recognition information from 
the sensor input processing unit 50, but also based on 
values of the emotion model and instinct model from the 
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emotion/instinct model unit 51 , etc., so that event in the 
event that the same state recognition information is in- 
put;, the destination of transition of the state differs ac- 
cording to the emotion model and instinct model (i.e., 
emotion/instinct information). 

[0057] Consequently, in the event that the emotion/ 
instinct state information indicates that the state is "not 
angry" and "not hungry", for example, and in the event 
that the state recognition information indicates "the palm 
of a hand being held out in front", the action determining 
mechanism unit 52 generates action instruction infor- 
mation for causing an action of "shaking hands" in ac- 
cordance with the hand being held out in front, and this 
is sent to the attitude transition mechanism unit 53. 
[0058] Also, in the event that the emotion/instinct 
state information indicates that the state is "not angry" 
and "hungry", for example, and in the eventthat the state 
recognition information indicates "the palm of a hand be- 
ing held out in front", the action determining mechanism 
unit 52 generates action instruction information for caus- 
ing an action of "licking the hand" in accordance with the 
hand being held out in front, and this is sent to the atti- 
tude transition mechanism unit 53. 
[0059] Further in the event that the emotion/instinct 
state information indicates that the state is "angry" for 
example, and in the event that the state recognition in- 
formation indicates "the palm of a hand being held out 
in front", the action determining mechanism unit 52 gen- 
erates action instruction information for causing an ac- 
tion of "looking the other way", regardless of whether 
the emotion/instinct information indicates "hungry" or 
"not hungry", and this is sent to the attitude transition 
mechanism unit 53. 

[0060] Incidentally, the action determining mecha- 
nism unit 52 is capable of determining the speed of walk- 
ing, the magnitude of movement of the legs and the 
speed thereof, etc., serving as parameters of action cor- 
responding to the state to which transition has been 
made, based on the state of emotions and instincts in- 
dicated by the emotion/instinct state information sup- 
plied from the emotion/instinct model unit 51 . 
[0061] Also : in addition to action instruction informa- 
tion for causing movement of the robot head, legs, etc., 
the action determining mechanism unit 52 generates ac- 
tion instruction information for causing speech by the ro- 
bot, and action instruction information for causing the 
robot to execute speech recognition. The action instruc- 
tion information for causing speech by the robot is sup- 
plied to the voice synthesizing unit 55, and the action 
instruction information supplied to the voice synthesiz- 
ing unit 55 contains text and the like corresponding to 
the synthesized sound to be generated by the voice syn- 
thesizing unit 55. Once the voice synthesizing unit 55 
receives the action instruction information from the ac- 
tion determining mechanism unit 52, synthesized sound 
is generated based on the text contained in the action 
instruction information while adding in the state of emo- 
tions and the state of instructs managed by the emotion/ 



instinct model unit 51 , and the synthesized sound ia sup- 
plied to and output from the speaker 1 8. Also, the action 
instruction information for causing the robot to execute 
speech recognition is supplied to the voice recognizing 
5 unit 50A of the sensor input processing unit 50, and up- 
on receiving such action instruction information, the 
voice recognizing unit 50 A performs voice recognizing 
processing. 

[0062] Further the action determining mechanism 
io unit 52 is arranged so as to supply the same action in- 
formation supplied to the emotion/instinct model unit 51 , 
to the sensor input processing unit 50 and the voice syn- 
thesizing unit 55. The voice recognizing unit 50A of the 
sensor input processing unit 50 and the voice synthe- 
*s sizing unit 55 each perform voice recognizing and voice 
synthesizing, adding in the action information from the 
action determining mechanism unit 52. This point will be 
described later. 

[0063] The attitude transition mechanism unit 53 gen- 
20 erates attitude transition information for causing transi- 
tion of the attitude of the robot from the current attitude 
to the next attitude, based on the action instruction in- 
formation from the action determining mechanism unit 
52, and outputs this to the control mechanism unit 54. 
25 [0064] Now, a next attitude to which transition can be 
made from the current attitude is determined by, e.g., 
the physical form of the robot such as the form, weight, 
and linkage state of the torso and legs, for example, and 
the mechanism of the actuators 3AA L through 5A 1 and 
30 5A 2 such as the direction and angle in which the joints 
will bend, and so forth. 

[0065] Also, regarding the next attitude, there are at- 
titudes to which transition can be made directly from the 
current attitude, and attitudes to which transition cannot 

35 be directly made from the current attitude. For example, 
a quadruped robot in a state lying on its side with its legs 
straight out can directly make transition to a state of lying 
prostrate, but cannot directly make transition to a state 
of standing, so there is the need to first draw the legs 

40 near to the body and change to a state of lying prostrate, 
following which the robot stands up, i.e., actions in two 
stages are necessary. Also, there are attitudes to which 
transition cannot be made safely. For example, in the 
event that a quadruped robot in an attitude of standing 

45 on four legs attempts to raise both front legs, the robot 
will readily fall over. 

[0066] Accordingly, the attitude transition mechanism 
unit 53 registers beforehand attitudes to which direct 
transition can be made, and in the event that the action 

50 instruction information supplied from the action deter- 
mining mechanism unit 52 indicates an attitude to which 
direct transition can be made, the action instruction in- 
formation is output without change as attitude transition 
information to the control mechanism unit 54. On the 

55 other hand, in the event that the action instruction infor- 
mation indicates an attitude to which direct transition 
cannot be made, the attitude transition mechanism unit 
53 first makes transition to another attitude to which di- 
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rect transition can be made, following which attitude 
transition information is generated for causing transition 
to the object attitude, and this information is sent to the 
to the control mechanism unit 54. Thus, incidents 
wherein the robot attempts to assume attitudes to which 
transition is impossible, and incidents wherein the robot 
falls over, can be prevented. 

[0067] That is to say, as shown in Fig. 7 for example, 
the attitude transition mechanism unit 53 stores an ori- 
ented graph wherein the attitudes which the robot can 
assume are represented as nodes NODE 1 through 
NODE 5, and nodes corresponding to two attitudes be- 
tween which transition can be made are linked by ori- 
ented arcs ARC 1 through ARC 10, thereby generating 
attitude transition information such as described above, 
based on this oriented graph. 

[0068] Specifically, in the event that action instruction 
information is supplied from the action determining 
mechanism unit 52, the attitude transition mechanism 
unit 53 searches a path from the current node to the next 
node by following the direction of the oriented arc con- 
necting the node corresponding to trie current attitude 
and the node corresponding to the next attitude to be 
assumed which the action instruction information indi- 
cates, thereby generating attitude transition information 
wherein attitudes corresponding to the nodes on the 
searched path are assumed. 

[0069] Consequently, in the event that the current at- 
titude is the node NODE 2 which indicates the attitude 
of "lying prostrate" for example, and action instruction 
information of "sit" is supplied, the attitude transition 
mechanism unit 53 generates attitude transition infor- 
mation corresponding to "sit", since direct transition can 
be made from the NODE 2 which indicates the attitude 
of "lying prostrate" to the node NODE 5 which indicates 
the attitude of "sitting" in the oriented graph, and this 
information is provided to the control mechanism unit 
54. 

[0070] Also, in the event that the current attitude is the 
node NODE 2 which indicates the attitude of "lying pros- 
trate", and action instruction information of "walk" is sup- 
plied, the attitude transition mechanism unit 53 search- 
es a path from the NODE 2 which indicates the attitude 
of "lying prostrate" to the node NODE 4 which indicates 
the attitude of "walking", in the oriented graph. In this 
case, the path obtained is NODE 2 which indicates the 
altitude of "lying prostrate", NODE 3 which indicates the 
attitude of "standing", and NODE 4 which indicates the 
attitude of "walking", so the attitude transition mecha- 
nism unit 53 generates attitude transition information in 
the order of "standing", and "walking", which is sent to 
the control mechanism unit 54. 

[0071 ] The control mechanism unit 54 generates con- 
trol signals for driving the actuators 3AA 1 through 5A t 
and 5A 2 according to the attitude transition information 
from the attitude transition mechanism unit 53, and 
sends this information to the actuators 3AA 1 through 
5A., and 5A 2 . Thus, the actuators 3AA n through 5A 1 and 



5A 2 are driven according to the control signals, and the 
robot acts in an autonomic manner. 
[0072] Next, Fig. 8 illustrates a configuration example 
of the voice recognizing unit 50A shown in Fig. 3. 

5 [0073] Audio signals from the microphone 1 5 are sup- 
plied to an A/D (Analog/Digital) converting unit 21 . At 
the A/D converting unit 21 1 the analog voice signals from 
the microphone 15 are sampled and quantized, and 
subjected to A/D conversion into digital voice signal da- 

*0 ta. This voice data is supplied to a characteristics ex- 
tracting unit 22. 

[0074] The characteristics extracting unit 22 performs 
MFCC (Mel Frequency Cepstrum Coefficient) analysis 
for example for each appropriate frame of the input voice 

*5 data, and outputs the analysis results to the matching 
unit 23 as characteristics parameters (characteristics 
vectors). Incidentally, at the characteristics extracting 
unit 22, characteristics extracting can be performed oth- 
erwise, such as extraclirTg linear prediction coefficients, 

20 cepstrum coefficients, line spectrum sets, power for pre- 
determined frequency bands (filter bank output), etc., as 
characteristics parameters. 

[0075] Also, the characteristics extracting unit 22 ex- 
tracts pitch information from the voice data input thereto. 
25 That is, the characteristics extracting unit 22 performs 
autocorrelation analysis for example of the voice data 
for example, thereby extracting pitch information of in- 
formation and the like relating to the pitch frequency, 
power (amplitude), intonation, etc., of the voice input to 
30 the microphone 15. 

[0076] The matching unit 23 performs voice recogni- 
tion of the voice input to the microphone 15 (i.e., the 
input voice) using the characteristics parameters from 
the characteristics extracting unit 22 based on continu- 
es ous distribution HMM (Hidden Markov Model) for exam- 
ple, while making reference to the acoustics model stor- 
ing unit24 : dictionary storing unit 25, and grammar stor- 
ing unit 26, as necessary. 

[0077] That is to say, the acoustics model storing unit 

40 24 stores acoustics models representing acoustical 
characteristics such as individual phonemes and sylla- 
bles in the language of the voice which is to be subjected 
to voice recognition. Here, voice recognition is per- 
formed based on the continuous distribution HMM meth- 

4 5 od, so the HMM (Hidden Markov Model) is used as the 
acoustics model. The dictionary storing unit 25 stores 
word dictionaries describing information relating to the 
pronunciation (i.e., phonemics information) for each 
word to be recognized. The grammar storing unit 26 

50 stores syntaxes describing the manner in which each 
word registered in the word dictionary of the dictionary 
storing unit 25 concatenate (connect). The syntax used 
here may be rules based on context-free grammar 
(CFG), stochastic word concatenation probability (N- 

55 gram), and so forth. 

[0078] The matching unit 23 connects the acoustic 
models stored in the acoustics model storing unit 24 by 
making reference to the word dictionaries stored in the 
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dictionary storing unit 25, thereby configuring word 
acoustic models (word models). Further, the matching 
unit 23 connects multiple word models by making refer- 
ence to the syntaxes stored in the grammar storing unit 
26, and recognizes the speech input from the micro- 
phone 1 5 using the word models thus connected, based 
on the characteristics parameters, by continuous distri- 
bution HMM. 

That is to say, the matching unit 23 detects a word model 
sequence with the highest score (likelihood) of obser- 
vation of the time-sequence characteristics parameters 
output by the characteristics extracting unit 22, and the 
phonemics information (reading) of the word string cor- 
relating to the word model sequence is output as the 
voice recognition results. 

[0079] That is to say, the matching unit 23 accumu- 
lates the emergence probability of each of the charac- 
teristics parameters regarding word strings correspond- 
ing to the connected word models, and with the accu- 
mulated value as the score thereof, outputs the phone- 
mics information of the word string with the highest 
score from the voice recognition results. 
[0080] Further, the marching unit 23 outputs the score 
of the voice recognizing results as reliability information 
representing the reliability of the voice recognizing re- 
sults. 

[0081 ] Also, the matching unit 23 detects the duration 
of each phoneme and word making up the voice recog- 
nizing results which is obtained along with score calcu- 
lation such as described above, and outputs this as pho- 
nemics information of the voice input to the microphone 
15. 

[0082] The recognition results of the voice input to the 
microphone 15, the phonemics information, and reliabil- 
ity information, output as described above, are output to 
the emotion/instinct model unit 51 and action determin- 
ing mechanism unit 52, as state recognition information. 
[0083] The voice recognizing unit 50A configured as 
described above is subjected to control of voice recog- 
nition processing based on the state of emotions and 
instincts of the robot, managed by the emotion/instinct 
model unit 51 . That is, the state of emotions and instincts 
of the robot managed by the emotion/instinct model unit 
51 are supplied to the characteristics extracting unit 22 
and the matching unit 23, and the characteristics ex- 
tracting unit 22 and the matching unit 23 change the 
processing contents based on the slate of emotions and 
instincts of the robot supplied thereto. 
[0084] Specifically, as shown in the flowchart in Fig. 
9, once action instruction information instructing voice 
recognition processing is transmitted from the action de- 
termining mechanism unit 52, the action instruction in- 
formation is received in step S1 , and the blocks making 
up the voice recognizing unit 50A are set to an active 
state. Thus, the voice recognizing unit 50A is set in a 
state capable of accepting the voice that has been input 
to the microphone 15. 

[0085] Incidentally, the blocks making up the voice 



recognizing unit 50A may be set to an active state at all 
times. In this case, an arrangement may be made for 
example wherein the processing from step S2 on in Fig. 
9 is started at the voice recognizing unit 50A each time 

5 the state of emotions and instincts of the robot managed 
by the emotion/instinct model unit 51 changes. 
[0086] Subsequently, the characteristics extracting 
unit 22 and the matching unit 23 recognize the state of 
emotions and instincts of the robot by making reference 

10 to the emotion/instinct model unit 51 in step S2, and the 
flow proceeds to step S3. In step S3, the matching unit 
23 sets word dictionaries to be used for the above-de- 
scribed score calculating (matching), based on the state 
of emotions and instincts. 

'5 [0087] That is to say, here, the dictionary storing unit 
25 divides the words which are to be the object of rec- 
ognition into several categories, and stores multiple 
word dictionaries with words registered for each cate- 
gory. In step S3, word dictionaries to be used for voice 

20 recognizing are set based on the state of emotions and 
instincts of the robot. 

[0088] Specifically, in the event that there is a word 
dictionary with the word "shake hands" registered in the 
dictionary storing unit 25 and also a word dictionary with - 
25 out the word "shake hands" registered therein, and in 
the event that the state of emotion of the robot repre- 
sents "pleasant", the word dictionary with the word 
"shake hands" registered therein is used for voice rec- 
ognizing. However, in the event that the state of emotion 
30 of the robot represents "cross", the word dictionary with 
the word "shake hands" not registered therein is used 
for voice recognizing. Accordingly, in the event that the 
state of emotion of the robot is pleasant, the speech 
"shake hands" is recognizes, and the voice recognizing 
35 results thereof are supplied to the action determining 
mechanism unit 52, thereby causing the robot to take 
action corresponding to the speech "shake hands" as 
described above. On the other hand, in the event that 
the results show that the robot is cross, the speech 
40 "shake hands" is not recognized (or erroneously recog- 
nized), so the robot makes to response thereto (or takes 
actions unrelated to the speech "shake hands"). 
[0089] Incidentally, the arrangement here is such that 
multiple word dictionaries are prepared, and the word 
^5 dictionaries to be used for voice recognizing are select- 
ed based on the state of emotions and instincts of the 
robot, but other arrangements may be made, sucn as 
an arrangement for example wherein just one word dic- 
tionary is provided and words to serve as the object of 
50 voice recognizing are selected from the word dictionary, 
based on the state of emotions and instincts of the robot. 
[0090] Following the processing of step S3, the flow 
proceeds to step S4, and the characteristics extracting 
unit 22 and the matching unit 23 set the parameters to 
55 be used for voice recognizing processing (i.e., recogni- 
tion parameters), based on the state of emotions and 
instincts of the robot. 

[0091 ] That is, for example, in the event that the emo- 
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tion state of the robot indicates "angry" or the instinct 
state of the robot indicates "sleepy", the characteristics 
extracting unit 22 and the matching unit 23 set the rec- 
ognition parameters such that the voice recognition pre- 
cision deteriorates. On the other hand, in the event that 
the emotion state of the robot indicates "pleasant", the 
characteristics extracting unit 22 and the matching unit 
23 set the recognition parameters such that the voice 
recognition precision improves. 

[0092] Now, recognition parameters which affect the 
voice recognition precision include, for example, thresh- 
old values compared with the voice input to the micro- 
phone 15, used in detection of voice sections, and so 
forth. 

[0093] Subsequently, the flow proceeds to step S5, 
wherein the voice input to the microphone 15 is taken 
into the characteristics extracting unit 22 via the AID 
converting unit 21 , and tne flow proceeds to step S6. At 
step S6, the above-described processing is performed 
at the characteristics extracting unit 22 and the matching 
unit 23 under the settings made in step S3 andS4, there- 
by executing voice recognizing of the voice input to the 
microphone 1 5. Then, the flow proceeds to step S7, and 
the phoncmics information, pitch information, and relia- 
bility information, which are the voice recognition results 
obtained by the processing in step S6, are output to the 
emotion/instinct model unit 51 and action determining 
mechanism unit 52 as state recognition information, and 
the processing ends. 

[0094] Upon receiving such state recognition informa- 
tion from the voice recognizing unit 50A, the emotion/ 
instinct model unit 51 changes the values of the emotion 
model and instinct model as described with Fig. 5 based 
on the state recognition information, thereby changing 
the state of emotions and the state of instincts of the 
robot. 

[0095] That is, for example, in the event that the pho- 
nemics information serving as the voice recognition re- 
sults in the state recognition information is, "Fool!", the 
emotion/instinct model unit 51 increases the value of the 
emotion unit 60C for "anger". Also, the emotion/instinct 
model unit 51 changes the values information output by 
the increasing/decreasing functions 65A through 65C, 
based on pitch frequency serving as the phonemics in- 
formation in the state recognition information, and the 
power and duration thereof, thereby changing the val- 
ues of the emotion model and instinct model. 
[0096] Also, in the event that the reliability information 
in the state recognition information indicates that the re- 
liability of the voice recognition results is low : the emo- 
tion/instinct model unit 51 increases the value of the 
emotion unit 60B for "sadness", for example. On the oth- 
er hand, in the event that the reliability information in the 
state recognition information indicates that the reliability 
of the voice recognition results is high, the emotion/in- 
stinct model unit 51 increases the value of the emotion 
unit 60A for "happiness", for example. 
[0097] Upon receiving the state recognition informa- 



tion from the voice recognizing unit 50A, the action de- 
termining mechanism unit 52 determines the next action 
of the robot based on the state recognition information, 
and generates action instruction information for repre- 

5 senting that action. 

[0098] That is to say, the action determining mecha- 
nism unit 52 determines an action to take corresponding 
to the phonemics information of the voice recognizing 
results in the state recognizing information as described 

10 above, for example (e.g., determines to shake hands in 
the event that the voice recognizing results are "shake 
hands"). 

[0099] Or, in the event that the reliability information 
in the state recognizing information indicates that the 

'5 reliability of the voice recognizing results is low, the ac- 
tion determining mechanism unit 52 determines to take 
* an action such as cocking the head or acting apologet- 
ically, for example, on the other hand, in the event that 
the reliability informatiorrin the state recognizing infor- 

20 mation indicates that the reliability of the voice recog- 
nizing results is high, the action determining mechanism 
unit 52 determines to take an action such as nodding 
the head, for example. In this case, the robot can indi- 
cate to the user the degree of understanding of the 

25 speech of the user. 

[0100] Next, action information indicating the con- 
tents of current or past actions of the robot are supplied 
from the action determining mechanism unit 52 to the 
voice recognizing unit 50A, as described above, and the 

30 voice recognizing unit 50A can be arranged to perform 
control of the voice recognizing processing based on the 
action information. That is, the action information output 
from the action determining mechanism unit 52 is sup- 
plied to the characteristics extracting unit 22 and the 

35 matching unit 23, and the characteristics extracting unit 
22 and the matching unit 23 can be arranged to change 
the processing contents based on the action information 
supplied thereto. 

[0101] Specifically, as shown in the flowchart in Fig. 

^0 to, upon action instruction information instructing the 
voice recognizing processing being transmitted from the 
action determining mechanism unit 52, the action in- 
struction information is received at the voice recognizing 
unit 50A in step S11 in the same manner as that of step 

45 S1 in Fig. 9, and the blocks making up the voice recog- 
nizing unit 50A are set to an active state. 
[0102] Incidentally, as described above, the blocks 
making up the voice recognizing unit 50A may be set to 
an active state at all times. In this case, an arrangement 

50 may be made for example wherein the processing from 
step S12 on in Fig. 1 0 is started at the voice recognizing 
unit 50A each time the action information output from 
the action determining mechanism unit 52 changes. 
[0103] Subsequently, the characteristics extracting 

55 unit 22 and the matching unit 23 make reference to the 
action information output from the action determining 
mechanism unit 52 in step S12, and the flow proceeds 
to step S13. In step S13, the matching unit 23 sets word 
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dictionaries to be used for the above-described score 
calculating (matching), based on the action information. 
[0104] That is, for example, in the event that the action 
information represents the current action to be "sitting" 
or "lying on side", it is basically inconceivable that the 
user would say, "Sit!" to the robot. Accordingly, the 
matching unit 23 sets the word dictionaries of the dic- 
tionary storing unit 25 so that the word "Sit!" is excluded 
from the object of speech recognition, in the event that 
the action information represents the current action to 
be "sitting" or "lying on side". In this case, no speech 
recognition is made regarding the speech "Sit!"*. Further, 
in this case, the number of words which are the object 
of speech recognition decrease, thereby enabling in- 
creased processing speeds and improved recognition 
precision. 

[0105] Following the processing of step S1 3, the flow 
proceeds to step S1 4, and the characteristics extracting 
unit 22 and the matching unit 23 set the parameters to 
be used for voice recognition processing (i.e., recogni- 
tion parameters) based on the action information. 
[01 06] That is, in the event that the action information 
represents "walking", for example, the characteristics 
extracting unit 22 and the matching unit 23 sets the rec- 
ognition parameters such that priority is given to preci- 
sion over processing speed, as compared to cases 
wherein the action information represents "sitting" or "ly- 
ing prostrate", for example. 

[0107] On the other hand, in the event that the action 
information represents "sitting" or "lying prostrate", for 
example, the recognition parameters are set such that 
priority is given to processing speed over precision, as 
compared to cases wherein the action information rep- 
resents "walking", for example. 

[01 08] In the event that the robot is walking, the noise 
level from the driving of the actuators 3AA 1 through 5A 1 
and 5A 2 is higher than in the case of sitting or lying pros- 
trate, and generally, the precision of voice recognition 
deteriorates due to the effects of the noise. Thus, setting 
the recognition parameters such that priority is given to 
precision over processing speed in the event that the 
robot is walking allows deterioration or voice recognition 
precision, due to the noise, to be prevented (reduced). 
[0109] On the other hand, in the event that the robot 
is sitting or lying prostrate, there is no noise from the 
above actuators 3AA 1 through 5A 1 and 5A 2 , so there is 
no deterioration of voice recognition precision due to the 
driving noise. Accordingly setting the recognition pa- 
rameters such that priority is given to processing speed 
over precision in the event that the robot is sitting or lying 
prostrate allows the processing speed of voice recogni- 
tion to bo improved, while maintaining a certain level of 
voice recognition precision. 

[0110] Now, as for recognition parameters which af- 
fect the precision and processing speed of voice recog- 
nition, there is for example the hypothetical range in the 
event of restricting the range serving as the object of 
score calculation by the Beam Search method at the 



matching unit 23 (i.e. : the beam width for the beam 
search), and so forth. 

[0111] Subsequently the flow proceeds to step S15, 
the voice input to the microphone 1 5 is taken into the 
5 characteristics extracting unit 22 via the A/D converting 
unit 21, and the flow proceeds to step S16. At stepS16, 
the above-described processing is performed at the 
characteristics extracting unit 22 and the matching unit 
23 under the settings made in step S1 3 and S1 4, there- 
to by executing voice recognizing of the voice input to the 
microphone 15. Then, the flow proceeds to step S17, 
and the phonemics information, pitch information, and 
reliability information, which are the voice recognition re- 
sults obtained by the processing in step S1 6, are output 
*s to the emotion/instinct model unit 51 and action deter- 
mining mechanism unit 52 as state recognition informa- 
tion, and the processing ends. 

[01 12] Upon receiving such state recognition informa- 
tion from the voice recognizing unit 50A, the emotion/ 

20 instinct model unit 51 and action determining mecha- 
nism unit 52 change the values of the emotion model 
and instinct model as described above based on the 
state recognition information, and determining the next 
action of the robot. 

25 [0113] Also, though the above arrangement involves 
setting the recognition parameters such that priority is 
given to precision over processing speed in the event 
that the robot is walking, since the effects of noise from 
the driving of the actuators 3AA 1 through 5A-, and 5A 2 

30 cause the precision of voice recognition to deteriorate, 
thereby preventing deterioration of voice recognition 
precision due to the noise, but an arrangement may be 
made wherein in the event that the robot is walking, the 
robot is caused to temporarily stop to perform voice rec- 

35 ognition, an prevention deterioration of voice recogni- 
tion precision can be realized with such a arrangement, 
as well. 

[0114] Next, Fig. 11 illustrates a configuration exam- 
ple of the voice synthesizing unit 55 shown in Fig. 3. 

40 [0115] The action instruction information containing 
text which output by tne action determining mechanism 
unit 52 which is the object of voice synthesizing is sup- 
plied to the text generating unit 31 , and the text gener- 
ating unit 31 analyzes the text contained in the action 

^5 instruction information, making reference to the diction- 
ary storing unit 34 and analyzing grammar storing unit 
35. 

[01 16] That is, the dictionary storing unit 34 has stored 
therein word dictionaries describing part of speech tn- 

50 formation for each word, reading, accentuation, and oth- 
er information thereof. Also, the analyzing grammar 
storing unit 35 stores analyzing syntaxes relating to re- 
strictions of word concatenation and the like, regarding 
the words described in the word dictionaries in the die- 

55 tionary storing unit 34. Then, the text generating unit 31 
performs morpheme analysis and grammatical struc- 
ture analysis of the input text based on the word diction- 
aries and analyzing syntaxes, and extracts information 
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necessary to the rule voice synthesizing performed by 
the latter rules synthesizing unit 32. Here, examples of 
information necessary for rule voice synthesizing in- 
clude pause positions, pitch information such as infor- 
mation for controlling accents and intonation, phone- 5 
mics information such as the pronunciation and the like 
of each word, and so forth. 

[01 1 7] The information obtained at the text generating 
unit 31 is then supplied to the rules synthesizing unit 32, 
and at the rules synthesizing unit 32, voice data (digital 10 
data) or synthesized sounds corresponding to the text 
input to the text generating unit 31 is generated using 
the phoneme storing unit 36. 

[0118] That is, phoneme data in the form of CV (con- 
sonant, Vowel), VCV, CVC, etc., is stored in the pho- 15 
neme storing unit 36, so the rules synthesizing unit 32 
connects the necessary phoneme data based on the in- 
formation from the text generating unit 31, and further 
adds pauses, accents, intonation, etc., in an appropriate 
manner, thereby generating voice data of synthesized 20 
sound corresponding to the text input to the text gener- 
ating unit 31 . 

[0119] This voice data is supplied to the D/A (Digital/ 
Analog) converting unit 33 : and there is subjected to D/ 
A conversion to analog voice signals. The voice signals 25 
are supplied to the speaker 18, thereby outputting the 
synthesized sound corresponding to the text input to the 
text generating unit 31 . 

[01 20] The voice synthesizing unit 55 thus configured 
receives supply of action instruction information con- 30 
taining text which is the object of voice synthesizing from 
the action determining mechanism unit 52, also receives 
supply of the state of emotions and instincts from the 
emotion/instinct model unit 51 , and further receives sup- 
ply of action information from the action determining 35 
mechanism unit 52, and the text generating unit 31 and 
rules synthesizing unit 32 perform voice synthesizing 
processing taking the state of emotions and instincts 
ana the action information into consideration. 
[0121] Now,, the voice synthesizing processing per- *o 
formed while taking the state of emotions and instincts 
into consideration will be described, with reference to 
the flowchart in Fig. 12. At the point that the action de- 
termining mechanism unit 52 outputs the action instruc- 
tion information containing text which is the object of 4 $ 
voice synthesizing to the voice synthesizing unit 55, the 
text generating unit 31 receives the action instruction 
information in step S21 , and the flow proceeds to step 
S22. At step S22, the state of emotions and instincts of 
the robot is recognized in step S22 in the text generating 50 
unit 31 and rules synthesizing unit 32 by making refer- 
ence to the emotion/instinct model unit 51 . and the flow 
proceeds to step S23. 

[0122] In step S23, at the text generating unit 31 , the 
vocabulary (speech vocabulary) used for generating 55 
text to be actually output as synthesized sound (hereaf- 
ter also referred to as "speech text") is set from the text 
contained in the action instruction information from the 



action determining mechanism unit 52, based on the 
emotions and instincts of the robot, and the flow pro- 
ceeds to step S24. In step S24, at the text generating 
unit 31 , speech text corresponding to the text contained 
in the action instruction information is generated using 
the speech vocabulary set in step S23. 
[01 23] That is, the text contained in tne action instruc- 
tion information from the action determining mechanism 
unit 52 is such tnat presupposes speech in a standard 
state of emotions and instincts, and in step S24 the text 
is corrected taking into consideration the state of emo- 
tions and instincts of the robot, thereby generating 
speech text. 

[01 24] Specifically, in the event that the text contained 
in the action instruction information is "What is it?" for 
example, and the emotion state of the robot represents 
"angry", the text is generated as speech text of "Yeah, 
what?" to indicate anger. Also, in the event that the text 
contained in the action instruction information is "Please 
stop" for example, and the emotion state of the robot 
represents "angry", the text is generated as speech text 
of "Quit itl" to indicate anger. 

[0125] Then, the flow proceeds to step S25, the text 
generating unit 31 performs text analysis of the speech 
text such as morpheme analysis and grammatical struc- 
ture analysis, and generates pitch information such as 
pitch frequency, power, duration, etc., serving as infor- 
mation necessary for performing rule voice synthesizing 
regarding the speech text. Further, the text generating 
unit 31 also generates phonemics information such as 
the pronunciation of each work making up the speech 
text. Here, in step S25, standard phonemics information 
is generated for the phonemics information of the 
speech text 

[0126] Subsequently, in step S26, the text generating 
unit 31 corrects the phonemics information of the 
speech text set in step S25 based on the state of emo- 
tions and instincts of the robot, thereby giving greater 
emotional expressions at the point of outputting the 
speech text as synthesized sound. 
[0127] Now, the details of the relation between emo- 
tion and speech are described in, e.g., "Conveyance of 
Paralinguistic Information by Speech: From the Per- 
spective of Linguistics", MAEKAWA, Acoustical Society 
of Japan 1997 Fall Meeting Papers Vol. 1-3-10, pp. 
381-384, September 1997, etc. 

[0128] The phonemics information and pitch informa- 
tion of the speech text obtained at the text generating 
unit 31 is supplied to the rules synthesizing unit 32 : and 
in step S27, at the rules synthesizing unit 32, rule voice 
synthesizing is performed following the phonemics in- 
formation and pitch information, thereby generating dig- 
ital data of the synthesized sound of the speech text. 
Now, at the rules synthesizing unit 32 also, pitch such 
as the position of pausing, the position of accent, into- 
nation, etc., of the synthesized sound, is changed so as 
to appropriately express the state of emotions and in- 
stincts of the robot, based on the state of emotions and 
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instincts thereof. 

[0129] The digital data of the synthesized sound ob- 
tained at the rules synthesizing unit 32 is supplied to the 
D/A converting unit 33. In step S28, at the D/A convert- 
ing unit 33, digital data from the rules synthesizing unit 
32 is subjected to D/A conversion, and supplied to the 
speaker 1 8, thereby ending processing. Thus, synthe- 
• sized sound of the speech text which has pitch reflecting 
the state of emotions and instincts of the robot is output 
from the speaker 1 8. 

[0130] Next, the voice synthesizing processing which 
is performed taking into account the action information 
will be described with reference to the flowchart in Fig. 
13. 

[0131] At the point that the action determining mech- 
anism unit 52 outputs the action instruction information 
containing text which is the object of voice synthesizing 
to the voice synthesizing unit 55, the text generating unit 
31 receives the action instruction information in step 
S31 , and the flow proceeds to step S32. At step S32, 
the current action of the robot is confirmed in the text 
generating unit 31 and rules synthesizing unit 32 by 
making reference to the action information output by the 
action determining mechanism unit 52, and the flow pro- 
ceeds to step S33. 

[0132] In step S33, at the text generating unit 31 , the 
vocabulary (speech vocabulary) used for generating 
speech text is set from the text contained in the action 
instruction information from the action determining 
mechanism unit 52, based on action information, and 
speech text corresponding to the text contained in the 
action instruction information is generated using the 
speech vocabulary. 

[0133] Then the flow proceeds to step S34, the text 
generating unit 31 performs morpheme analysis and 
grammatical structure analysis of the speech text, and 
generates pitch information such as pitch frequency, 
power, duration, etc., serving as information necessary 
for performing rule voice synthesizing regarding the 
speech text. Further, the text generating unit 31 also 
generates phonemics information such as the pronun- 
ciation of each work making up the speech text. Here, 
in step S34 as well, standard pitch information is gener- 
ated for the pitch information of the speech text, in the 
same manner as with step S25 in Fig. 12. 
[0134] Subsequently, in step S35, the text generating 
unit 31 corrects the pitch information of the speech text 
generated in step S25 based on the action information. 
[0135] That is, in the event that the robot is walking, 
for example, there is noise from the driving of the actu- 
ators 3AA 1 through 5A A and 5A 2 as described above On 
the other hand, in the event that the robot is sitting or 
lying prostrate, there is no such noise. Accordingly, the 
synthesized sound is harder to hear in the event that the 
robot is walking, in comparison to cases wherein the ro- 
bot is sitting or lying prostrate. 

[0136] Thus, in the event that the action information 
indicates the robot is walking, the text generating unit 



31 corrects the pitch information so as to slow the 
speech speed of the synthesized sound or increase the 
power thereof, thereby making the synthesized sound 
more readily understood. 
5 [0137] In other arrangements, correction may be 
made in step S35 such that tne pitch frequency value 
differs depending on whether the action information in- 
dicates that the robot is on its side or standing. 
[0138] The phonemics information and pitch informa- 
10 tion of the speech text obtained at the text generating 
unit 31 is supplied to the rules synthesizing unit 32 : and 
in step S36, at the rules synthesizing unit 32, rule voice 
synthesizing is performed following the phonemics in- 
formation and pitch information, thereby generating dig- 
's ital data of the synthesized sound of the speech text. 
Now, at the rules synthesizing unit 32 also, the position 
of pausing, the position of accent, intonation, etc., of the 
synthesized sound, is changed as necessary, at the time 
of rule voice synthesizing. 
20 [0139] The digital data of the synthesized sound ob- 
tained at the rules synthesizing unit 32 is supplied to the 
D/A converting unit 33. In step S37, at the D/A convert- 
ing unit 33, digital data from the rules synthesizing unit 
32 is subjected to D/A conversion, and supplied to the 
25 speaker 18, thereby ending processing. 

[0140] Incidentally, in the event of generating synthe- 
sized sound at the voice synthesizing unit 55 taking into 
consideration the state of emotions and instincts, and 
the action information, the output of such synthesized 
30 sound and the actions of the robot may be synchronized 
in a way. 

[01 41 ] That is, for example, in the event that the emo- 
tion state represents "not angry", and the synthesized 
sound "What is it?" is to be output taking the state of 

35 emotion into consideration, the robot may be made to 
face the user in a manner synchronous with the output 
of the synthesized sound, on the other hand, for exam- 
ple, in the event that the emotion state represents "an- 
gry", and the synthesized sound "Yeah, what?" is to be 

40 output taking the state of emotion into consideration, the 
robot may be made to face the other way in a manner 
synchronous with the output of the synthesized sound. 
[0142] Also, an arrangement may be made wherein, 
in the event of output of the synthesized sound "What 

45 is it?", the robot is made to act at normal speed, and 
wherein in the event of output of the synthesized sound 
"Yeah, what?", the robot is made to act at a speed slower 
than normal, in a sullen and unwilling manner. In this 
case, the robot can express emotions to the user with 

50 both motions and synthesized sound. 

[0143] Further, at the action determining mechanism 
unit 52, the next action is determined based on an action 
model represented by a finite automaton such as shown 
in Fig. 6, and the contents of the text output as synthe- 

55 sized sound can be correlated with the transition of state 
in the action model in Fig. 6. 

[0144] That is, for example, in the event of making 
transition from the state corresponding to the action "sit- 
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ting" to the state corresponding to the action "standing", 
a text such as "Here goes!" can be correlated thereto. 
In this case, in the event of the robot making transition 
from a sitting position to a standing position, the synthe- 
sizedsound "Here goes!" can be output in a manner syn- 
chronous with the transition in position. 
[0145] As described above, a robot with a high enter- 
tainment nature can be provided by controlling the voice 
synthesizing processing and voice recognizing process- 
ing, based on the state of the robot. 
[0146] Next, Fig. 14 illustrates a configuration exam- 
ple of the image recognizing unit SOB making up the sen- 
sor input processing unit 50 shown in Fig. 3. 
[0147] Image signals output from the CCD camera 
are supplied to the A/D converting unit 41 , and there 
subjected to A/D conversion, thereby becoming digital 
image data. This digital image data is supplied to the 
image processing unit 42. At the image processing unit 
42, predetermined image processing such as DCT (Dis- 
crete Cosine Transform) and the like for example is per- 
formed to the image data from the A/D converting unit 
41, and this is supplied to the recognition collation unit 
43. 

[0148] The recognition collation unit 43 calculates the 
distance between each of multiple image patterns 
stored in the image pattern storing unit 44, and the out- 
put of the image processing unit 42 : and detect the im- 
age pattern with the smallest distance. Then, the recog- 
nition collation unit 43 recognizes the image taken with 
the CCD camera 1 6, and outputs the recognition results 
as state recognition information to the emotion/instinct 
model unit 51 and action determining mechanism unit 
52, based on the detected image pattern. 
[0149] Now, the configuration shown in the block dia- 
gram in Fig. 3 is realized by the CPU 1 0A executing con- 
trol programs, as described above. Now, taking only the 
power of the CPU 1 0A (hereafter also referred to simply 
as "CPU power") into consideration as a resource nec- 
essary for realizing the voice recognizing unit 50A, the 
CPU power is determined singly by the hardware em- 
ployed for the CPU 10A, and the processing amount (the 
processing amount per unit time) which can be executed 
by the CPU power is also determined singly. 
[0150] On the other hand, in the processing to be ex- 
ecuted by the CPU 10A, there is processing which 
should be performed with priority over the voice recog- 
nition processing (hereafter also referred to as "priority 
processing"), and accordingly, in the event that the load 
of the CPU 10A for priority processing increases, the 
CPU power which can be appropriated to voice recog- 
nition processing decreases. 

[01 51 ] That is, representing the load on the CPU 1 0A 
regarding priority processing as x%, and representing 
the CPU power which can be appropriated to voice rec- 
ognition processing as y%, the relation between x and 
y is represented by the expression 



x + y=100% 

and is as shown in Fig. 15. 

5 [0152] Accordingly, in the event that the load for pri- 
ority processing is 0%, 1 00% of the CPU power can be 
appropriated to voice recognition processing. Also, in 
the event that the load regarding priority processing is 
S (0 < S < 100)%, 100 - S% of the CPU power can be 

10 appropriated. Also, in the event that the load for priority 
processing is 1 00%, no CPU power can be appropriated 
to voice recognition processing. 

[01 53] Now, for example, in the event that the robot is 
walking for example, and CPU power appropriated to 
'5 the processing for the action of "walking" (hereafter also 
referred to as "walking processing") is insufficient, the 
walking speed becomes slow, and in a worst scenario, 
the robot may stop walking. Such slowing or stopping 
while walking is unnatural to the user, so there is the 

20 need to prevent such a state if at ail possible, and ac- 
cordingly, it can be said that the walking processing per- 
formed while the robot is walking must be performed 
with priority over the voice recognition processing. 
[0154] That is, in the event that the processing cur- 

25 rently being carried out is obstructed by voice recogni- 
tion processing being performed and the movement of 
the robot is no longer smooth due to this, the user will 
sense that this is unnatural. Accordingly, is can be said 
that basically, the processing being currently performed 

30 must be performed with priority over the voice recogni- 
tion processing, and that voice recognition processing 
should be performed within a range so as to not obstruct 
the processing being currently performed. 
[01 55] To this end, the action determining mechanism 

35 unit 52 is arranged so as to recognize the action being 
currently taken by the robot, and controlling voice rec- 
ognition processing by the voice recognizing unit 50A, 
based on the load corresponding to the action. 
[0156] That is, as shown in the flowchart in Fig. 16, in 

40 step S41 , the action determining mechanism unit 52 rec- 
ognizes the action being taken by the robot, based on 
the action model which it itself manages, and the flow 
proceeds to step S42. In step S42, the action determin- 
ing mechanism unit 52 recognizes the load- regarding 

^5 the processing for continuing the current action recog- 
nized in step S41 in the same manner (i.e., maintaining 
the action). 

[0157] Now, the load corresponding to the processing 
for continuingthe current action in the same mannercan 

50 be obtained by predetermined calculations. Also, the 
load can also be obtained by preparing beforehand the 
a table correlating actions and estimated CPU power for 
performing processing corresponding to the actions, 
and making reference to the table. Note that less 

55 processing amount is required for the table than for cal- 
culation. 

[0158] Following obtaining the load corresponding to 
the processing for continuing the current action in the 
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same manner, the flow proceeds to step S43, and the 
action determining mechanism unit 52 obtains the CPU 
power which can be appropriated to voice recognizing 
processing, based on the load, from the relationship 
shown in Fig. 1 5. Further, the action determining mech- 5 
anism unit 52 performs various types of control relating 
to voice recognizing processing based on the CPU pow- 
er which can be appropriated to the voice recognizing 
processing, the flow returns to step S41 , and subse- 
quently the same processing is repeated. w 
[01 59] That is, the action determining mechanism unit 
52 changes the word dictionaries used for voice recog- 
nizing processing, based on the CPU power which can 
be appropriated to the voice recognizing processing. 
Specifically, in the event that sufficient CPU power can '5 
be appropriated to the voice recognizing processing, 
settings are made such that dictionaries with a great 
number of words registered therein are used for voice 
recognizing processing. Also, in the event that sufficient 
CPU power cannot be appropriated to the voice recog- 20 
nizing processing, settings are made such that diction- 
aries with few words registered therein are used for 
voice recognizing. 

[0160] Further, in the event that practically no CPU 
power can be appropriated to voice recognizing 25 
processing, the action determining mechanism unit 52 
puts the voice recognizing unit 50A to sleep (a state 
wherein no voice recognizing processing is performed). 
[0161] Also, the action determining mechanism unit 
52 causes the robot to take actions corresponding to the 30 
CPU power which can be appropriated to voice recog- 
nizing processing. 

[0162] That is, in the event that practically no CPU 
power can be appropriated to voice recognizing 
processing, or in the event that sufficient CPU power 35 
cannot be appropriated thereto, no voice recognizing 
processing is performed, or the voice recognizing pre- 
cision and processing speed may deteriorate, giving the 
user an unnatural sensation. 

[0163] Accordingly, in the event that practically no 40 
CPU power can be appropriated to voice recognizing 
processing, or in the event that sufficient CPU power 
cannot be appropriated thereto, the action determining 
mechanism unit 52 causes the robot to take listless ac- 
tions or actions such as cocking the head, thereby no- 45 
tifying the user that voice recognition is difficult. 
[0164] Also, in the event that sufficienl CPU power 
can be appropriated to voice recognizing processing, 
the action determining mechanism unit 52 causes the 
robot to take energetic actions or actions such as nod- 50 
ding the head, thereby notifying the user that voice rec- 
ognition is sufficiently available. 

[0165] In addition to the robot taking such as actions 
as described above to notify the user whether voice rec- 
ognition processing is available or not, arrangements 55 
may be made wherein special sounds such as "beep- 
beep-beep" or "tinkle-tinkle-tinkle", or predetermined 
synthesized sound messages, are output from the 



speaker 18. 

[0166] Also, in the event that the robot has a liquid 
crystal panel, the user can be notified regarding whether 
voice recognition processing is available or not by dis- 
playing predetermined messages on the liquid crystal 
panel. Further, in the event that the robot has a mecha- 
nism by expressing facial expressions such as blinking 
and so forth, the user can be notified regarding whether 
voice recognition processing is available or not by such 
changes in facial expressions.' 

[0167] Note that while in the above case, only the 
CPU power has been dealt with, but other resources for 
voice recognition processing (e.g., available space on 
the memory 10B, etc.) may be the object of such man- 
aging. 

[0168] Further in the above, description has been 
made with focus on the relation between voice recogni- 
tion processing at the voice recognizing unit 50A and 
other processing, but the same can be said regarding 
the relation between image recognizing processing at 
the image recognizing unit SOB and other processing, 
voice synthesizing processing at the voice synthesizing 
unit 55 and other processing, and so forth. 
[0169] The above has been a description of an ar- 
rangement wherein the present invention has been ap- 
plied to an entertainment robot (i.e., a robot serving as 
a pseudo pet), but the present invention is by no means 
restricted to this application; rather, the present inven- 
tion can be widely applied to various types of robots, 
such as industrial robots, for example. 
[0170] Further in the present embodiment, the 
above-described series of processing is performed by 
the CPU 10A executing programs, by the series of 
processing may be carried out by dedicated hardware 
for each. 

[01 71 ] Also, in addition to storing the programs on the 
memory 1 0B (see Fig. 2) beforehand, the programs may 
be temporarily or permanently stored (recorded) on re- 
movable recording media such as floppy disks, CD- 
ROM (Compact Disk Read-Only Memory), MO (Magne- 
to-Optical) disks, DVDs (Digital Versatile Disk), magnet- 
ic disks, semiconductor memory, etc. Such removable 
recording mediums may be provided as so-called pack- 
aged software, so as to be installed in the robot (memory 
10B). 

[0172] Also, in addition to installing the programs from 
removable recording media, arrangements may be 
made wherein the programs are transferred from a 
download site in a wireless manner via a digital broad- 
cast satellite, or by cable via networks such as LANs 
(Local Area Networks) or the Internet, and thus installed 
to the memory 1 0B. 

[0173] In this case, in the event that a newer version 
of the program is released, the newer version can be 
easily installed to the memory 10B. 
[0174] Now, in the present specification, the process- 
ing steps describing the program for causing the CPU 
10A to perform various types of processing do not nec- 
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essarily need to be processed in the time-sequence fol- 
lowing the order described in the flowcharts; rather, the 
present specification includes arrangements wherein 
the steps are processed in parallel or individually (e.g., 
parallel processing or processing by objects), 5 
[01 75] Also, the programs may be processed by a sin- 
gle CPU, or the processing thereof may be dispersed 
between multiple CPUs and thus processed. 
[0176] In so far as the embodiments of the invention 
described above are implemented, at least in part, using 10 
softwarecontrolled data processing apparatus, it will be 
appreciated that a computer program providing such 
software control and a storage medium by which such 
a computer program is stored are envisaged as aspects 
of the present invention. is 

Claims 

1. An voice processing device built into a robot, said 20 
voice processing device comprising: 

voice processing means for processing voice; 
and 

control means for controlling voice processing 25 
by said voice processing means, based on the 
state of said robot. 

2. An voice processing device according to Claim 1 , 
wherein said control means control said voice proc- 30 
ess based on the state of actions, emotions or in- 
stincts of said robot. 

3. An voice processing device according to Claim 1, 
wherein said voice processing means comprises 35 
voice synthesizing means for performing voice syn- 
thesizing processing and outputting synthesized 
sound; 

and wherein said control means control the voice 
synthesizing processing by said voice synthesizing 40 
means, based on the state of said robot. 

4. An voice processing device according to Claim 3, 
wherein said control means control phonemics in- 
formation and pitch information output by said voice 45 
synthesizing means. 

5. An voice processing device according to Claim 3, 
wherein said control means control the speech 
speed or volume of synthesized sound output by so 
said voice synthesizing means. 

6. An voice processing device according to Claim 1, 
wherein said voice processing means extract the 
control pitch information or phonemics information 55 
of the input voice; 

and wherein the emotion state of said robot is 
changed based on said pitch information or phone- 



mics information, or said robot takes actions corre- 
sponding to said pitch information or phonemics in- 
formation. 

7. An voice processing device according to Claim 1 , 
wherein said voice processing means comprises 
voice recognizing means for recognizing input 
voice; 

and wherein said robot takes actions corresponding 
to the reliability of the voice recognition results out- 
put from said voice recognizing means, or the emo- 
tion state of said robot is changed based on said 
reliability. 

8. An voice processing device according to Claim 1 , 
wherein said control means recognizes the action 
which said robot is taking, and controls voice 
processing by said voice processing means based 
on the load regarding that action. 

9. An voice processing device according to Claim 8, 
wherein said robot takes actions corresponding to 
resources which can be appropriated to voice 
processing by said voice processing means. 

10. . An voice processing method for an voice process- 
ing device built into a robot, said method compris- 
ing: 

an voice processing step for processing voice; 
and 

a control step for controlling voice processing 
in said voice processing step, based on the 
state of said robot. 

11. A recording medium recording programs to be ex- 
ecuted by a computer, for causing a robot to perform 
voice processing, said program comprising: 

an voice processing step for processing voice; 
and 

a control step for controlling voice processing 
in said voice processing step, based on the 
state of said robot. 
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